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What is claimed is: 

A method for uniform representation of a subject genome sequence comprising 
the s\eps of: 

providing a set of known biological fragments, the set being of a 
%Q predetermined number of said known biological fragments; 

fy comparing each known biological fragment from the set to a subject 

T; genome sequence, for each known biological fragment said comparing including 

£H (i) countinguhe number of times the known biological fragment is found in the 

s 10 subject genome sequence and (ii) from said counted number of times, forming a 

f ^ vector elemenA such that for each known biological fragment there is a 

ij* respective vectolt element representing the number of times that known 

C3 biological fragment is found in the subject genome sequence; and 

from the fowned vector elements, forming a vector having a length equal 
15 to the predetermineoWimber of known biological fragments in the provided set, 

such that the formed vector provides a fixed length representation of the subject 
genome sequence. 

2. A method as claimed in Claim 1 wherein the set of known biological fragments 
is from published databases of motifs or proteins. 

20 3. A method as claimed in Claim 1 further comprising the step of: 

^ \ for each desired subject genome sequence, repeating the comparing and 

^ \ forming steps siich that a respective vector representation is formed and each 
V / desired subject gW>me sequence has a same length vector representation. 
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A method as claimed in Claim 3 wherein for each subject genome sequence, 
having formed respective vector representations each of the same length, using 
the same length vector representation as input into one or more sequence 
analyses. 

A method as claimed in Claim 4 wherein the sequence analyses include one of 
indexing, classification and clustering, 

A method as claimed r^^im 1 wherein the subject genome sequence is a 
protein sequence or sub sequence. 

A method as claimed in Claim 1 wherein the subject genome sequence is a DNA 
sequence or subsequence. 


A method as claimed in Claim 1 wherein the counting includes determining 
probability of the subject genome sequence being generated by the known 
biological fragment. 


A method as claimed in Claim 8 wherein the counting determining probability 
employs a Oth order Markov model for each known biological fragment, 

ApparatusVfor forming uniform representations of genome sequences, 
comprisingX 

a dataVtore of a predefined number of known biological sequences; 

a comparison routine executed by a digital processor having access to the 
data store, the comparison routine comparing each known biological sequence 
from the data store tKa subject genome sequence and generating a score 
indicative of the comparison, said scores forming a vector having a length equal 
to the predefined number oif known biological sequences, such that said 
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comparison routine outputs the formed vector as a fixed length representation of 
the subject genome sequence. 

1 1 . Apparatus as claimed in Claim 1 0 wherein the data store is a published database 
of motifs or proteins. 


5 12. Apparatus as claimed in Claim 1 0 further comprising a plurality of different 
subject genbme sequences; and 
f 0 c^j f \ wherein the comparison routine forms for each subject genome sequence, 

ji; / a respective vector such that a corresponding plurality of same length vector 

representations Is provided. 
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10 13. Apparatus as claimed in Claim 12 wherein the output of the comparison routine 
feeds the corresponding plurality of same length vector representations into 
C3 further analysis processors. 

14. Apparatus as claimed in Claim 13 wherein the further analysis processors 
include at least one of a classifier, an indexer and a clustering member. 


15 15. Apparatus as claimed\j*iClaim 10 wherein the subject genome sequence is a 
protein sequence or sut sequence. 

16. Apparatus as claimed in Claim 10 wherein the subject genome sequence is a 
DNA sequence or subsequence. 

17. Apparatus as claimed in Claim 10 wherein the generated score is a probability of 
20 the subject genome sequence being generated by the known biological sequence. 
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1 8. Apparatus as claimed in Claim 1 0 wherein the generated score is a counting of a 
number of occurrences of the known biological sequence found in the subject 
genome sequence. 




